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The Efficacy of Written Corrective Feedback: 


A Critique of a Meta-analysis 
John Truscott 
National Tsing Hua University, Taiwan 


Abstract 
An influential meta-analysis on the effectiveness of written error correction (Kang & Han, 
2015) concluded that the practice is valuable for language classes. This paper critically 
examines the meta-analysis and challenges its conclusion. The average effect size of the 21 
included studies is unimpressive, even if taken at face value. The three studies that carry most 
of the weight for the favorable conclusion are all essentially the same experiment, an 
experiment which is too narrow and specialized to support any general conclusions on the 
value of correction and the findings of which are challenged by other research. Two other 
studies obtained moderately large effect sizes: In one of them the effect disappeared within 
two weeks; the other was not a study of second language learning. For all the remaining 
studies modest or weak results were reported. Some obtained negative effects which were 
reported in the meta-analysis as positive effects. Others relied on inappropriate comparisons. 
Two relevant studies that found correction ineffective or harmful were inappropriately 
excluded from the meta-analysis. Several of the included studies fail on the authors’ inclusion 
criteria and should not have been used. The paper also examines some issues that arise in a 
meta-analysis on this topic and offers suggestions for future work. 
keywords: written error correction, meta-analysis, effect size, inclusion criteria, control group 


The effectiveness of error correction for improving learners’ writing skills is an 
important issue for language teachers and so, not surprisingly, has inspired a great deal of 
research. An important tool for understanding this body of research is meta-analysis (Cohen, 
1992; Lipsey & Wilson, 2001; Norris & Ortega, 2000; Rosenthal, 1991). Its value lies in its 
ability to bring the results of different studies together and place them in a common form, 
effect size, so they can be compared and averaged. A large number of meta-analyses have 
been done on error correction research, reaching a variety of conclusions (see Plonsky & 
Brown, 2015; Truscott, 2016). 

The primary meta-analysis dealing specifically with written correction is that of Kang 
and Han (2015), published in the Modern Language Journal. Those authors looked at 21 
studies, concluding that writing instructors can take their findings as a favorable message 
about the effectiveness of written correction. Because of the importance of this conclusion for 
teaching practice, critical analysis is necessary. This is the purpose of this paper. I will 
suggest that the authors’ positive conclusion is unwarranted. In the process, I will consider 
various issues arising in the use of meta-analysis and in original research on this topic. I will 
not attempt here to provide an alternative meta-analysis, a project that would be extremely 
ambitious and would go well beyond the goals of the paper. 


The Meta-analysis 


The effect size measure that has been most commonly used in this area is Cohen’s d, 
which is the difference between the means of two groups divided by their pooled standard 
deviation. So if the mean score of a group that received correction is one standard deviation 
better than the mean of a group that did not, the effect size is 1.00. Kang and Han (2015) used 
Hedge’s g, which follows the same principle but is more conservative and adjusts for small 
sample sizes. Interpretation of effect sizes, both d and g, is based on the following 
benchmarks (Plonsky & Oswald, 2014): 


large effect: 1.00 
medium effect: .70 
small effect: 40 


Negative effect sizes indicate that the comparison group outperformed the experimental 
group; i.e., they point to harmful effects of the treatment. 

Table 1 lists the studies that were included in the meta-analysis along with the 
reported effect size for each (from Appendix C of Kang & Han, 2015). I have arranged them 
in descending order and divided them in terms of their relation to the above benchmarks. One 
complication is that Kang and Han’s table is apparently missing one of the studies they used, 
as it lists them from 1 to 21 but does not include a number 16. I will briefly return to the 
missing study (apparently Sheen, 2010) below. 



















































































Study Effect size (g) 
Bitchener (2008) 1.482 
Bitchener & Knoch (2008) 1.375 
Bitchener & Knoch (2010b) 1.161 
Shintani & Ellis (2013) .902 
van Beuningen et al. (2008) .888 
Bitchener & Knoch (2010a) .642 
Hartshorn et al. (2010) .607 
Sheen et al. (2009) .570 
Chandler (2003), Study 1 .496 
Fazio (2001) 481 
Evans et al. (2011) 473 
Sun (2013) 472 
Ellis et al. (2008) 430 
Kepner (1991) .383 
Jhowry (2010) 341 
Mubarak (2013) 245 
Sheen (2007) 104 
Bitchener et al. (2005) .103 
Semke (1980) 089 
Truscott & Hsu (2008) .068 





Table 1. Effect sizes reported by Kang and Han (2015) 


Some of the numbers can be challenged, as can Kang and Han’s (2015) decisions 
about which studies and which numbers to include. But before getting into these detailed 
points, it is worthwhile to consider the findings at face value. 

First, the overall effect size the authors reported was .54, meaning that the effects of 
correction were slightly closer to the “small” benchmark than to the “medium” benchmark. It 
is appropriate to ask whether a finding like this can justify a favorable recommendation to 
teachers. Returning to the table, the first thing to note is that it is dominated by effects that 
range from unimpressive to essentially non-existent. Of the 20 effect sizes shown, 15 did not 
reach the benchmark for medium effect, nearly all of them falling well short of it; 7 of these 
obtained effects that fall short even of the “small” benchmark, again well short in most cases. 
Thus, any favorable conclusions drawn from the meta-analysis necessarily depend on the 5 
studies (1/4 of the sample) that reported better results, especially the 3 lying above the “large” 
benchmark. These will therefore be considered in more detail in the two following sections, 
after which I will turn more briefly to the remaining studies. 
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The Three Studies that Obtained Large Effect Sizes 


First, the sample contains three studies that yielded large effect sizes. All three came 
from the research of Bitchener and Knoch (Bitchener, 2008; Bitchener & Knoch, 2008, 
2010b). By my calculation, if these and the fourth member of this group (Bitchener & Knoch, 
2010a) were removed, the overall effect size would fall below the “small” benchmark. Thus, 
favorable conclusions about the value of correction rest on this research, and its limitations 
are therefore limitations on any such conclusions. 

The first limitation is that the experiments reported in the four papers are virtually 
identical. The work can reasonably be seen as one original study with replications applying it 
to different groups.' The inclusion in a meta-analysis of each as a distinct study is not wrong, 
but we have to recognize that the three large effect sizes, with their very substantial influence 
on the overall effect size, reflect a narrow base. 

What then is special about these studies? Why did they obtain findings so much 
stronger than others? The main answer is readily apparent: The researchers deliberately 
selected as the target of correction a single, very simple error type: the use of a for first 
mention of a noun and the for subsequent mentions (“I read a book today; the book was about 
linguistics”). The writing tasks were then designed to support the focus on this one aspect of 
grammar. The testing consisted of those same tasks. It was done in the same context, by the 
same researcher; in other words, the learners were being repeatedly reminded of the 
corrections they had received. This treatment should be expected not only to keep the 
information fresh but, perhaps more importantly, to lead the corrected students to pay greater 
attention to that particular grammar point during the testing, potentially introducing a 
significant bias. 

Under these conditions it is hardly surprising that strong results were obtained. The 
question is what such results tell us about the value of correcting errors in writing classes. 
They tell us, possibly, that if we select one very simple point to correct and design writing 
assignments to support correction of that one point, then afterward the students will probably 
write more accurately on that point when they are doing writing tasks that are built around it 
and the context encourages special attention to it. These are limitations of the Bitchener and 
Knoch studies and therefore limitations on any favorable conclusions drawn from the meta- 
analysis. 

Even this is probably too optimistic an assessment of this research, though, as 
questions can be raised about the more general impact of the treatment on learners’ ability to 
write accurately. A general problem with the teaching of grammar points is that it can easily 
result in over-application of what has been taught, leading to increased errors (e.g. 
Lightbown, 1983; Pica, 1983; Weinert, 1987). The grammar principle that is the focus of the 
Bitchener and Knoch studies predominates in the tasks/tests that were used (because the tasks 
were designed for that purpose) but its role is considerably smaller in English article usage in 
general. Learners who need to be corrected for failure to follow it in the carefully selected 
contexts used in these studies are presumably learners who would have difficulty judging 
when other factors overrule it. Thus, the treatment might well encourage them to make 
mistakes by applying it where it is not appropriate (cf. Ellis et al. 2008, Note 2). These 
studies have not been concerned with such negative influences on learning, picking out only 
the positive effects. 


! Į also find it difficult to judge the extent to which Bitchener (2008), Bitchener and Knoch 
(2008), and Bitchener and Knoch (2010a) are distinct (non-overlapping) studies. 
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Ekiert and di Gennaro (2019) obtained evidence that harmful effects do occur. Their 
conceptual replication of Bitchener and Knoch (2010a) looked at other uses of English 
articles, in addition to those targeted by the correction, and found that the correction groups’ 
scores on these other uses were consistently below those of the control group, with effect 
sizes (g) ranging from -.01 to -1.06; in other words, the correction appeared to harm other 
aspects of article use. Limitations of narrow studies like the Bitchener and Knoch research 
are also suggested by the findings of Mubarak (2013), in which comprehensive correction 
resulted in negligible gains in the general accuracy of English article use, despite extensive 
correction of article errors. Similarly, Pashazadeh and Marefat (2010), in an uncontrolled 
study, targeted “the entire article system of English” and found that substantial gains on an 
immediate posttest became negligible four weeks later and negative after an additional four- 
week delay. 

Other findings raise doubts about the value of correction even for the specific article 
uses on which Bitchener and Knoch found dramatic improvements. Shintani and Ellis (2013), 
studying the first-mention function of the English indefinite article, found that direct 
correction had no effect on accurate use and that metalinguistic feedback had only immediate 
effects, disappearing within two weeks. Ellis, Sheen, Murakami, and Takashima (2008), 
looking at both first and subsequent mention, obtained more favorable results, based entirely 
on the rather puzzling finding that the performance of one correction group improved 
dramatically during a two-week period of no treatment. Additional challenges to the 
Bitchener and Knoch findings come from Sheen, Wright, and Moldawa (2009), which I will 
consider below. 

More recently, Ekiert and di Gennaro (2019) found that while their correction groups 
did improve on the targeted uses, the control group improved more: The effect sizes are all 
negative, ranging from -.15 to -.55. The correction, in other words, was ineffective and 
possibly harmful, even for the very simple target error on which it focused. The limitations 
on these findings are that the study sacrificed some validity by using a form of story retelling 
instead of a more realistic task and, especially, combining results of this test and an error 
correction task to use as the measure. On the other hand, these choices seem most likely to 
benefit the correction groups, as the measure is more open to the explicit knowledge that the 
treatment probably produced. 


Thus, the three studies that yielded large effect sizes, and carry most of the burden in 
any favorable conclusions from the meta-analysis, actually have little to tell us about the 
general value of correction, and may not be informative even about its value in the particular 
narrow context in which it was used in those studies. They have little to offer to teachers who 
are deciding whether to correct in their classes. 


The Two Cases of Moderately Large Effect Sizes 


While favorable conclusions from the meta-analysis rest mainly on the large effect 
sizes of the Bitchener and Knoch research, two other studies yielded effect sizes not very far 
below the “large” benchmark. They also require a closer look here. 

For Shintani and Ellis (2013), first, the relatively good effect size shown in the table 
(g = .902) is quite misleading. It appears in the meta-analysis because Kang and Han (2015) 
used only immediate posttests, excluding data from follow-up testing. While this decision can 
certainly be defended, the effect in this case is to hide the main finding of a study. The second 
posttest of Shintani and Ellis, given just two weeks after the treatment, found no significant 
advantages for corrected learners and yielded a small g. The authors concluded that “the 
effect was not durable” (p. 286). When its most important finding is recognized, this turns out 


to be a study that found correction ineffective. It should also be noted, again, that this was a 
very narrowly-focused study, looking specifically at the first-mention use of English a. 

Interestingly, Ellis, Sheen, Murakami, and Takashima (2008) showed the opposite 
pattern to that found by Shintani and Ellis (2013), with weak results on the immediate 
measure and very strong results on the delayed test. Thus, if delayed posttest results are used 
the effect size for this study becomes far greater than the .430 listed in the table. Strong 
results here are perhaps not surprising, as this study had the same narrow focus as the 
Bitchener studies and so most of the comments above apply here as well. Control issues also 
appear to be present in this experiment: (a) the description of participants suggests that the 
classes used as experimental groups consisted of superior students; (b) during the course of 
the study these students were taking a reading class while the control group’s class was on 
oral communication; and (c) substantial differences were found on the pretest, favoring the 
experimental groups. So, again, it is probably not surprising to see strong effects. But this 
leaves the mystery of why weak results immediately after the treatment turned into 
outstanding results after a two-week period of no treatment. 

The other moderately large effect size came from van Beuningen, de Jong, and 
Kuiken (2008). Given Kang and Han’s (2015) inclusion criteria, as well as generally accepted 
thinking in the area, this study should not have been included in the meta-analysis. According 
to the authors’ description of their participants, around 20% were native speakers of the target 
language, Dutch, and “most students were born in The Netherlands” but “many of them only 
started learning Dutch in school (i.e. at age four)” [emphasis added], meaning about 10 years 
prior to the study. In other words the authors’ description of their participants suggests that 
this was not a study of second language learning. 

This point is further clarified in van Beuningen, de Jong, and Kuiken (2012). The 
2008 paper is in fact a report of the pilot study that was done for this main experiment. In the 
later paper the authors make it clear that they were not concerned with the distinction 
between L1 and L2 learners. They included students whose writing was considered weak, 
without regard to the language background of those students (see especially their Note 1). 
Their concession to the L1-L2 distinction was to reanalyze their data without the students 
who came from families that used only Dutch at home, with the result that the findings did 
not change. But this additional analysis still included an unknown and probably quite large 
number of students who could not reasonably be classified as L2 learners: those for whom 
Dutch was one of the home languages, those who had significant very early exposure to 
Dutch outside the home, and those who had acquired native or near-native knowledge 
through school experience, starting at age 4. 

Kang and Han (2015) excluded this main study on the grounds that the effect size it 
produced was an outlier, five standard deviations above the overall average of their sample. 
This conclusion appears to reflect a misreading of the results. van Beuningen et al. (2012) 
used as their measures both a revision task, in which learners used the corrections they were 
given to revise their assignment, and new writing tasks. Kang and Han’s stated policy was, 
appropriately, to use only data from new writings. But the extreme effect size they obtained 
for this study could only have come from the revision data; the new writings showed only 
moderate gains (for both measures, see Table 3 of van Beuningen et al., 2012; also Tables 4 
and 5). So it appears that the decision to exclude the study as an outlier was a mistake. But 
while it should not have been excluded for this reason, its exclusion was nonetheless 
appropriate, as the authors’ description of their participants makes it clear that this was not a 
study of second language learning. 


The Remaining Studies 


The conclusion to this point is that none of the five studies for which substantial effect 
sizes were reported actually provide any meaningful support for the use of error correction in 
second language writing instruction. I turn now to the remaining studies. 

First, the number reported in Table 1 for Fazio (2001) is incorrect. The effect size is 
listed as .481, indicating a small positive effect. But in fact the study found a negative effect; 
i.e., the performance of the correction groups was poorer than that of the no-correction 
(commentaries) group. The number .481 appears to have come from a confusion between two 
groups. Fazio used both native and non-native speakers, reporting their results separately. 
The meta-analysis should, of course, use the non-natives (Fazio’s Table 1), but Kang and Han 
appear to have used the results for the native speakers (Table 2). In any case, the number for 
this study should be negative. 

A similar problem arises, but in a somewhat more confused form, with Jhowry 
(2010). Kang and Han (2015) list the effect size as .341. But in Jhowry’s main analysis, 
presented in her Table 2, the posttest score for the control group is higher than that of the 
correction group, so the g should be negative. This score represents the total number of 
correct uses of the forms divided by the total number of uses (p. 27). To confuse things, the 
charts in Jhowry’s Appendix C portray the results of another measure, error rates, and here 
the correction group is noticeably better than the control group, though no specific numbers 
are reported. The author tentatively attributed this finding to the adoption of an avoidance 
strategy by corrected students; i.e., they made fewer errors with the target forms because they 
limited their use of those forms. I have doubts about the inclusion of this study, due to 
vagueness in Jhowry’s description of the treatment and the scoring, along with the 
uncertainty created by contrasts between the main analysis and the error rates. But if its 
results are to be included the effect size has to be negative. 

One requirement for inclusion of a study in the meta-analysis was that it had to use a 
group that received no error feedback. The inclusion of Chandler (2003) represents a 
deviation from this policy, as Kang and Han (2015) more or less concede (Note 4), because 
the group identified as control group did receive such feedback, differing from the 
experimental group only in not being required to put it to use until after the study. Hartshorn 
et al. (2010) and Evans et al. (2011) also lacked the necessary no-correction group. These 
studies compared the authors’ novel version of correction, “dynamic written corrective 
feedback”, to “traditional process writing instruction”, using a comparison group which 
received “a wide variety of feedback on the linguistic accuracy of what they produced” 
(Hartshorn et al., p. 95). Neither study should be a part of the meta-analysis. Sun (2013) is a 
borderline case. The control group received comments like “Pay attention to conjugation of 
pl/sing. verbs”, raising doubts about whether the study should have been included. 

Control issues appear in a different form with Sheen (2007) and Sheen, Wright, and 
Moldawa (2009). The issue here is the proper identification of control groups. The treatment 
given the correction groups involved not only correction but also a reading and writing task 
designed to present the target forms to the learners and give them practice in using them. The 
group that was labeled “control” did not receive this treatment. The comparison made was 
thus the combination of the task + correction vs. the absence of both. For the purpose of 
determining the effect of correction, this is not a legitimate comparison — the control group 
should have been given the same task, just without corrective feedback on it. The implication 
is that Sheen (2007) had no valid comparison group. The same is true for Sheen (2010), 
which can be identified as the missing study in Table 1, based on Kang and Han’s (2015) 
Appendix B. These studies did not meet the requirements for inclusion in the meta-analysis. 


The closely related study of Sheen, Wright, and Moldawa (2009) is more interesting. 
In addition to the groups used by Sheen (2007), they included a group which was given the 
reading and writing task but received no feedback on their writing. This group, labeled the 
“writing practice” group, provides a valid comparison for measuring the effects of correction: 
It is the actual control group of this study. Interestingly, this point seems to have been 
recognized by Ellis, Sheen, Murakami, and Takashima (2008) in their closely related 
experiment. They used the same correction groups as well as the “writing practice” group but 
properly identified the latter as the control group and did without the condition that was 
labeled “control” by Sheen (2007) and Sheen, Wright, and Moldawa. For Sheen, Wright, and 
Moldawa, then, it should be clear that the calculation of the effect size should use the scores 
of the “writing practice” group and not the “control” group. This could be done in two ways. 
If effect sizes for the two correction groups, focused and unfocused, are averaged, the 
resulting g is a tiny positive number, far below the reported .570. But it is more enlightening 
to separate the correction groups, as the focused portion of the experiment is another study on 
first use vs. subsequent use of English articles and therefore of only limited interest. The 
unfocused group (which was actually a “somewhat focused” group) is the more interesting of 
the two. It yields negative effect sizes; i.e., the correction appears to have been harmful. 

A different sort of control problem comes up with Bitchener, Young, and Cameron 
(2005). During the study one of the two correction groups received 20 hours per week of 
language instruction and the other 10 hours, while the comparison group had only 4. The 
negligible effect size produced by this study thus came from a comparison that was biased in 
favor of corrected students. 


I will conclude this section with a brief summary of the studies with effect sizes 
below the “medium” benchmark but above the “small” benchmark; i.e. those that fall in the 
middle ground between the top five and the seven that obtained very weak results. Bitchener 
and Knoch (2010a), first, is another instance of the narrowly focused work considered above, 
with little to offer writing instructors. Ellis et al. (2008), with its large effect size on a delayed 
posttest, falls into the same category.” Hartshorn et al. (2010), Chandler (2003), and Evans et 
al. (2011) lacked a no-correction group and should have been excluded. Sun (2013), with an 
effect size only slightly above the “small” benchmark, is a marginal member of this group. 
Fazio (2001) actually yields a negative effect size. That for Sheen et al. (2009) is at best a 
tiny positive number or, more appropriately, a negative number, depending on how the 
calculation is done. Altogether, there is nothing in these studies to support a favorable view 
of correction, and two of them argue strongly against such a view. 


Exclusion of Relevant Studies 


Kang and Han (2015) stated that a study was to be excluded if “the effects of 
feedback could not be isolated from that of other treatments such as conferences (e.g., Polio, 
Fleck, & Leder, 1998; Sheppard, 1992)” (p. 4). It is unclear, though, what things should count 
as “other treatments”. Conferences in which the teacher discusses the feedback with students, 
as in Sheppard (1992), seem to me an integral part of the feedback process as it is commonly 
done in writing classes, not an extra factor contaminating the data. If discussion of feedback 
is taken as a separate treatment, one might ask why revision following feedback, which is 
commonly treated as an independent variable in this research area, does not also qualify as an 


? Ellis et al. (2008), in contrast to Sheen et al. (2009), did not report results for the 
comparison between the unfocused group and the control group on the errors targeted for the 
former, data that would give it more general relevance. 
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“other treatment’. To the best of my knowledge no one (including Kang and Han) thinks that 
a study should be excluded because students rewrote their work after receiving feedback. 
There is no apparent reason why conferences should be treated differently. The same question 
can be raised regarding Hartshorn et al.’s (2010) practice of requiring students to maintain “a 
comprehensive inventory of the errors they produce along with the written context in which 
they are produced” (p. 88). Why was this not classified as an additional, contaminating 
treatment? 

It is also unclear how the ban on studies using conferences was applied. Mubarak’s 
(2013) treatment included discussion with individual students while they were writing plus 
whole-class comments on errors and peer discussion of the errors. This study, included in the 
meta-analysis, would appear to fit the criteria for exclusion. For Hartshorn et al. (2010), 
“Classroom discussions and activities were centered on the most frequent types of errors 
being produced by the students in their daily writing” (pp. 94-95). A similar approach appears 
to have been taken in the closely related study of Evans et al. (2011). It is not clear why these 
studies were not excluded. 

Beyond the lack of clarity and the apparent inconsistency in its application, an 
exclusion policy like this one seems counterproductive. A meta-analysis should serve the 
interests of the field, meaning in this case that it should provide teachers with information 
they can use in deciding to correct or not to correct in their classes. This decision needs to be 
based on the effects of correction within genuine teaching contexts, not on what it does in 
isolation from the practices that naturally and properly accompany it. Sheppard (1992) 
provides such information, as do Polio, Fleck, and Leder (1998). The information in these 
cases is that correction, in the realistic teaching contexts used in these studies, is harmful 
(Sheppard) or has no effect at all (Polio, Fleck, & Leder). This information should be 
included in a meta-analysis that is concerned with the effectiveness of error correction, 
particularly one that offers advice to writing instructors. 


Discussion 


In this section I will first summarize the above critique. As many of the problems 
stem more from the nature of meta-analysis than from this particular use of it, I will then 
extend the critique to meta-analysis in general. The final sub-section will turn to some issues 
in the research itself, focusing on the use of controls. 


Summary of the Critique 

Table 2 summarizes the main points raised above regarding the various studies 
included by Kang and Han (2015). They point to a number of adjustments that I believe 
should be made if we are to understand the literature that Kang and Han set out to synthesize. 

Most importantly, the studies that narrowly focused on first use vs. subsequent use of 
English articles should be treated as a separate category. This category includes Bitchener 
(2008) and Bitchener and Knoch (2008, 2010a, 2010b), as well as some of the findings (not 
all) from the related studies of Ellis et al. (2008) and Sheen, Wright and Moldawa (2009), and 
perhaps Shintani and Ellis (2013) and Ekiert and di Gennaro (2019). Note that I am not 
suggesting that this type of research is entirely pointless or that the findings should be 
discarded. A teacher might be interested in teaching this particular function of a and the and 
want to see relevant evidence on its effectiveness. In terms of the case against correction, this 
research might be seen as a pursuit of the “special, hypothetical circumstances under which 
correction might not be a bad idea” (Truscott, 1999, p. 121). The essential point here is that 
its (very large) limitations must be recognized. 








Study Effect Comments 
size (g) 
Bitchener (2008) 1.482 | targeted only one, very simple feature; tasks/tests 


were designed specifically for that feature; corrected 
students likely paid more attention to it during the 
testing; did not look at possible harmful effects; 
results are challenged by later research 







































































Bitchener & Knoch (2008) 1.375 | essentially the same study as Bitchener (2008), with 
same limitations 

Bitchener & Knoch (2010b) 1.161 essentially the same study as Bitchener (2008), with 
same limitations 

Shintani & Ellis (2013) .902 this number comes from the immediate posttest; a 
test 2 weeks later showed small effects; “the effect 
was not durable” 

van Beuningen et al. (2008) 888 not a study of L2 learning; should be excluded 

Bitchener & Knoch (2010a) .642 essentially the same study as Bitchener (2008), with 
same limitations 

Hartshorn et al. (2010) .607 lacked a no-correction group; should be excluded 

Sheen et al. (2009) 570 apparently based on an invalid comparison; the 
appropriate comparison yields a tiny g; one 
correction group’s treatment was of the B&K type; 
the other group (“unfocused”) yields a negative g 

Chandler (2003), Study 1 .496 lacked a no-correction group; should be excluded 

Fazio (2001) 481 no-correction group outperformed both correction 
groups; this number should be negative 

Evans et al. (2011) 473 lacked a no-correction group; should be excluded 

Sun (2013) 472 comparison group received feedback like Pay 
attention to conjugation of pl/sing. verbs; its 
inclusion is questionable 

Ellis et al. (2008) .430 small immediate effect with large delayed effect; 
same narrow focus as Bitchener (2008) with similar 
limitations 

Kepner (1991) 383 

Jhowry (2010) 341 this should be a negative g; description of treatment 
and scoring is limited; inclusion is questionable 

Mubarak (2013) .245 two measures of general accuracy yielded 
conflicting results; measures of tense and article 
accuracy gave weak results 

Sheen (2007) .104 represents the combined effect of correction and a 
treatment designed to demonstrate the target use and 
provide practice with it; should be excluded (but the 
failure of correction despite the bias is noteworthy) 

Bitchener et al. (2005) .103 comes from a comparison that was biased in favor 
of corrected learners; should be excluded (but the 
failure of correction despite the bias is noteworthy) 

Semke (1980) .089 

Truscott & Hsu (2008) .068 one-shot treatment, not intended to directly test the 








effectiveness of correction 





Table 2. Comments on the studies included by Kang and Han (2015) 





I suggest that a meta-analysis on written error correction should also include the 
findings of a number of additional studies, beginning with Sheppard (1992) and Polio, Fleck, 
and Leder (1998). Several studies not mentioned by Kang and Han (2015) should also be 
considered, some of them appearing after the meta-analysis was completed, others 
unpublished or appearing in relatively obscure sources. My tentative list of candidates 
includes the following: Nakazawa (2006), Baldwin (2008), Mufioz (2011), Khanlarzadeh and 
Nemati (2016), and Bonilla López et al. (2018). Also interesting is Nicolas—Conesa, 
Manchón, and Cerezo (2019), though the use of a one-shot treatment limits its 
meaningfulness. Two other potentially useful studies (Robb, Ross, & Shortreed, 1986; Karim 
& Nassaji, 2018) did not provide the necessary information for effect size calculation and so 
cannot be included in a meta-analysis, but their findings should nonetheless be recognized. In 
this group of ten additional studies, all except one (Bonilla Lopez et al., 2018) obtained 
results that were quite weak and in some cases negative, which is to say the correction 
appeared to be harmful. 

I will not attempt the very large task of providing an alternative meta-analysis here, 
one that would incorporate these studies and, in principle, deal with all the often difficult 
issues raised above. My goal in this paper is to examine one influential meta-analysis, 
showing that its findings cannot support the claim that written error correction is effective, 
and to raise a number of issues that come up in a meta-analysis on this topic and in the 
research that provides the data for it. 


Some General Limitations of Meta-analysis 

While meta-analysis is a valuable tool, its limitations must be recognized. Subjective 
judgments are difficult if not impossible to avoid in establishing and applying inclusion 
criteria and in deciding what number(s) to use when a study includes multiple measures. This 
subjectivity can lead to a variety of different conclusions, even if we set aside the (very 
significant) issue of reviewer bias (see Truscott, 2016). 

Another concern is that the judgments, no matter how fairly and properly they are 
made, can have the effect of hiding important information. Kang and Han’s (2015) decision 
to use only immediate posttests was reasonable, but the consequence is that some important 
information gets lost. Results from delayed tests are more meaningful than those from 
immediate tests, and the two are often quite different. The striking contrasts found by Ellis et 
al. (2008) and by Shintani and Ellis (2013) are cases in point (but are by no means the only 
examples). The effect size reported for a given study cannot be understood without reference 
to longer-term effects. 

This is not the only way that important findings can get lost in a meta-analysis. A 
single study will often include various measures, and decisions have to be made about how to 
deal with them. Kang and Han (2015) adopted the reasonable policy of using only one 
number per study, but this policy, again, results in the loss of interesting information. Sheen 
et al. (2009) provides an example. The study looked at both the effects of narrow focus on a 
specific use of English articles (first vs. subsequent mention) and a type of correction that is 
potentially of more general value. Each by itself can provide interesting information (one 
more so than the other), as can the comparison between them. Averaging them together has 
the effect of removing this information in favor of a single number that is essentially 
meaningless. For this case the solution is, again, to treat the two types of information as 
separate targets of meta-analysis. 

The striking inconsistencies in the findings of Mubarak (2013) provide another 
example. General accuracy was measured in two ways: error rates showed strong effects (at 
least on the immediate posttest) while error-free T-units showed weak results. The author also 
measured specific accuracy, in terms of error-free T-units, on tense and article use, which 
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were perhaps the most common errors in the study. The results were very weak, raising the 
crucial (but unaddressed) question of what error types were responsible for the strong effects 
found for general accuracy on the error-rate measure. These inconsistencies and open issues 
need to be recognized and considered. The reliance on a single number, in order to meet the 
requirements of the meta-analysis, is not a good way to understand the results of this study. 


Control Issues in Error Correction Research 

Issues repeatedly arose in the above discussion regarding the nature and use of control 
groups — issues for both researchers and reviewers. One can ask, first, whether experimental 
and control groups are truly comparable in some of the studies, a necessity if any meaningful 
conclusions are to be drawn from the comparison. An example of the problem is Bitchener, 
Young and Cameron (2005), in which the control group received only a small fraction of the 
overall language instruction received by the correction groups during the study. The 
questions raised above about the groups used by Ellis, Sheen, Murakami, and Takashima 
(2008) point to other examples of possible problems. The use of control conditions by Sheen 
(2007, 2010) and Sheen, Wright, and Moldawa (2009) is simply wrong. 

Another crucial issue is whether the comparison group used in a study was a genuine 
no-correction group, a condition generally considered necessary in this research area. Should 
a meta-analysis include a study like Sun (2013), in which the comparison group received at 
the end of their assignments some fairly specific comments on errors of language form? Kang 
and Han’s (2015) defense of their decision to include Chandler (2003), despite the extensive 
error correction provided to the comparison group, seems to me untenable. Hartshorn et al. 
(2010) and Evans et al. (2011), both included in the meta-analysis, were not interested in 
using a no-correction group; their goal was to show that one type of error correction is 
superior to another. 

The limitations resulting from the lack of a genuine no-correction group, or the use of 
a questionable one, have to be recognized by both authors and reviewers. Findings like those 
of Hartshorn et al. (2010) and Evans et al. (2011), for example, may have significant 
implications for teachers who are already committed to correcting their learners’ errors and 
are now looking for the best way to do it, but they have nothing to offer to a teacher who is 
undecided about whether to correct or not to correct. It is important to avoid the fallacy of 
citing such experiments as evidence on this more fundamental question. 

One more issue regarding control groups should be considered. Truscott (2003) 
argued that the absence of correction is not inherently demotivating and can in fact have 
positive effects on students’ attitudes. But if the conditions of an experiment encourage the 
uncorrected students in the intuitive belief that correction would benefit them, this might be 
expected to negatively influence their performance. They could be led to feel that they are 
being denied valuable help which other students are receiving — that they are being cheated. 
So when favorable results are reported for corrected groups relative to a no-correction group, 
we need to ask exactly how the latter was treated and how this treatment might have 
influenced students’ attitudes and therefore their performance. This question becomes 
especially significant when positive results obtained in a study owe much of their strength to 
declines in the control group’s performance over the course of the study, as in the cases of 
van Beuningen et al. (2008) and Bonilla Lopez et al. (2018). 

The issue here is the validity of the comparisons that are made in the experiments. 
The ultimate pedagogical issue behind this research is whether teachers should correct in 
their classes. A teacher who decides not to is presumably one who feels that teaching without 
correction is a good idea, and who will show this belief to the students and possibly explain 
to them the reasons for it. Thus if we want to apply the findings of a study to real teaching, 
the researcher should try to create these conditions for the control groups. At the very least, 
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researchers should be careful to avoid encouraging in uncorrected students a negative view of 
the treatment they are receiving. To the best of my knowledge, this point has never been 
addressed in any research reports, so its significance is an open question. 


Conclusion 


Kang and Han (2015) conclude with a statement that their findings provide “a clear 
message for L2 writing instructors, that written corrective feedback can improve the 
grammatical accuracy of student writing” (p. 12). I have argued here that their findings in fact 
offer no basis for such a message. The argument took the form of a detailed critique rather 
than an alternative meta-analysis, because the former is, I believe, the most effective way to 
make the point. It also has to be recognized that any meta-analysis on this topic inevitably 
incorporates many specific assumptions on potentially contentious questions, questions that 
need to be explicitly addressed in their own right. 

While meta-analysis can be a valuable tool, it has important limitations. Its results are 
inevitably influenced, if not shaped, by subjective judgments that the reviewer has to make. 
Its requirements can tie the reviewer’s hands, potentially resulting in questionable inclusion 
of some information and questionable exclusion of other information. These factors readily 
lead to debatable and somewhat simplistic summaries of the research findings in the field. It 
should come as no surprise then that meta-analyses of error correction research have reached 
wildly differing conclusions (see Plonsky & Brown, 2015; Truscott, 2016). The moral is that 
the numbers that come out of a meta-analysis and the conclusions offered by its authors 
should never be taken simply at face value; they must be subjected to critical analysis, of the 
sort I have offered here. 
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